Determining individuals most susceptible to a disease allows productive resource allocation. For diseases such as Dementia, individuals both inherit risk factors and accrue them throughout life. No factor is causative on its own, but understanding what contributes to a high risk allows the public health sector to assess and prevent potential health crises. Dementia is a clinical syndrome characterized by difficulties in memory and language, psychological and psychiatric changes, and impairments in activities of daily life (Burns and Iliffe 2009). Dementia’s complex list of possible symptoms are reflected in its causes. Common origins of dementia can be degenerative neurological diseases such as Parkinson’s or Alzheimer’s; however vascular disorders in the brain, traumatic head injuries and some infections can lead to a dementia diagnosis.
The data used in analysis is attained from a longitudinal study of 150 participants. Participants were right handed, either male of female and aged between 60 and 96. They were characterized as either nondemented, demented or converted (became demented throughout the course of the study). For each session, participants took part in T1 weighted MRI scans, the results of which are recorded in the data set. Participants underwent 2 or more sessions, each separated by at least a year.
This work flow aims to look at two questions. What factors are associated with an increased risk of dementia and what factors are associated with an increased risk over time. It is important to note, no determinant causes dementia. The profiles of two people characterized as suffering with dementia maybe completely different.
Work flow is produced with R, a statistical computing language, (R Core Team 2020) and R Markdown which generates this html report.(Allaire et al. 2020). The bookdown package is used to add features to R Markdown such as cross referencing (Xie 2016).
Data is imported using R, the tidyverse (Wickham et al. 2019) and readxl (Wickham and Bryan 2019) packages.
Raw data is two excel sheets within the same spreadsheet dementia.xlsx. One sheet is visit data, containing information regarding amount of visits and MRIs as well as numerical results. The second sheet, patient data, contains information on current dementia status, sex, and education and social status. Each row is one patient’s data at one given time. Replicate subject_IDs can be seen as some patients had data collected once a year over a course of multiple years. Explanations of each column can be seen in 2.1.
| Term | Definition |
|---|---|
| MMSE | Mini-Mental State Examination score (range: 0 = worst to 30 = best). A 30 point questionairre used to measure cognitive impairement. A score above 24 is considered normal. Lower scores may correlate with dementia although this is not true in every case. |
| CDR | Clinical Dementia rating (0 = no impairment, 0.5 = questionable, 1 = mild, 2 = moderate, 3 = severe). A clinical tool that measures relative dementia symptoms based on 6 domains (memory, orientation, judgment and problem solving, community affairs, home and hobbies, and personal care) |
| eTIV | Estimated total intracranical volume (mm3) |
| nWBV | Normalized whole-brain volume (%) |
| ASF | Atlas Scaling Factor (unitless) |
| M_F | Patient sex, Female is reprsesnted by a 1, Male is reprsented by a 2 |
| EDUC | Years of Education |
| SES | Socioeconomic status, assesed by Hollingshead Four Factor Index Of Social Status, measures the social status of an indvidual based on 4 domains: marital status, retired/employed status, educational attainment, and occupational prestige. A score of 1 indicates high status, while 5 indictaes lowest status |
3 data sets were created with the raw data. Each one starts by merging visit_data and patient_data into one set by subject ID. Post import and merging, data variable names are cleaned with the janitor (Firke 2020) package. From here they differ as described below:
Dementia: used to look at which factors are associated with an increased risk of dementia. Columns not used in analysis (subject_id, visit, group and mri_number) and rows with NA values are removed. Finally the values in m_f have been converted to numerical values for analysis (F = 1, M = 2).
Dementia2: used to look at factors contributing to dementia over time. The aim of this data set was use in either a paired Student’s t-test or paired samples Wilcoxon test. In this data set, the cdr and mri_number columns and NA rows were removed. The rows are rearranged into visit number ascending order. The values in m_f have been converted to numerical values for analysis (F = 1, M = 2). The Nondemented and Demented rows from the group column are removed as these did not change over time. Only visits 1 and 2 are kept, for most subjects there was no data for visit number 3 or higher. OAS2_ are removed from the subject_id strings. Unique subject_ids were removed as they do not have pairs. Finally the visit levels were ordered, purely to have start and end in order in the boxplots.
Dementia_extract: used to generate some of the values used for inline reporting. In this data set all repeated subject_ids were removed so that accurate numbers about the number of participants could be recorded.
Using the dementia data set, this section looks at which determinants correlate with a high clinical dementia rating (CDR). In other words which determinants are linked with dementia. A list and explanation of the determinants used in this analysis can be seen in 2.1.
ggplot2 from the tidyverse package (Wickham et al. 2019). Arrangement of plots into a grid was achieved using ggarrange from the ggpubr package (Kassambara 2020).
Figure 3.1: Scatter Plots That Demonstrate Correlations Between Determinant And A High CDR
Table generated using the kableExtra package (Zhu 2020).
| Determinant | CDR | Mean | N | Standard Deviation | Standard Error | Minimum | Maximum |
|---|---|---|---|---|---|---|---|
| Age | 0.0 | 77.1553398 | 206 | 8.0894478 | 0.5636185 | 60.000 | 97.000 |
| Age | 0.5 | 77.4363636 | 110 | 7.3015359 | 0.6961741 | 62.000 | 92.000 |
| Age | 1.0 | 74.3714286 | 35 | 6.8645968 | 1.1603286 | 61.000 | 96.000 |
| Age | 2.0 | 85.0000000 | 3 | 11.2694277 | 6.5064071 | 78.000 | 98.000 |
| MMSE | 0.0 | 29.2233010 | 206 | 0.9205729 | 0.0641394 | 25.000 | 30.000 |
| MMSE | 0.5 | 26.4636364 | 110 | 3.0400304 | 0.2898555 | 17.000 | 30.000 |
| MMSE | 1.0 | 20.3142857 | 35 | 5.2735267 | 0.8913887 | 4.000 | 30.000 |
| MMSE | 2.0 | 20.3333333 | 3 | 5.0332230 | 2.9059326 | 15.000 | 25.000 |
| eTIV | 0.0 | 1486.8592233 | 206 | 179.9986303 | 12.5410988 | 1106.000 | 2004.000 |
| eTIV | 0.5 | 1482.4545455 | 110 | 174.0359889 | 16.5936805 | 1143.000 | 1928.000 |
| eTIV | 1.0 | 1528.0000000 | 35 | 157.8443015 | 26.6805566 | 1274.000 | 1957.000 |
| eTIV | 2.0 | 1538.0000000 | 3 | 157.4452286 | 90.9010451 | 1401.000 | 1710.000 |
| nWBV | 0.0 | 0.7404515 | 206 | 0.0373497 | 0.0026023 | 0.644 | 0.837 |
| nWBV | 0.5 | 0.7205182 | 110 | 0.0345072 | 0.0032901 | 0.646 | 0.806 |
| nWBV | 1.0 | 0.6990571 | 35 | 0.0224564 | 0.0037958 | 0.657 | 0.756 |
| nWBV | 2.0 | 0.7066667 | 3 | 0.0503322 | 0.0290593 | 0.660 | 0.760 |
| ASF | 0.0 | 1.1971068 | 206 | 0.1405721 | 0.0097941 | 0.876 | 1.587 |
| ASF | 0.5 | 1.1995091 | 110 | 0.1365395 | 0.0130185 | 0.910 | 1.535 |
| ASF | 1.0 | 1.1600286 | 35 | 0.1146708 | 0.0193829 | 0.897 | 1.377 |
| ASF | 2.0 | 1.1490000 | 3 | 0.1146865 | 0.0662143 | 1.026 | 1.253 |
| EDUC | 0.0 | 15.1601942 | 206 | 2.7047506 | 0.1884489 | 8.000 | 23.000 |
| EDUC | 0.5 | 14.0090909 | 110 | 3.1781809 | 0.3030277 | 6.000 | 20.000 |
| EDUC | 1.0 | 14.0000000 | 35 | 2.4970571 | 0.4220797 | 8.000 | 20.000 |
| EDUC | 2.0 | 17.0000000 | 3 | 3.0000000 | 1.7320508 | 14.000 | 20.000 |
| SES | 0.0 | 2.3349515 | 206 | 1.0497116 | 0.0731369 | 1.000 | 5.000 |
| SES | 0.5 | 2.6818182 | 110 | 1.2186006 | 0.1161890 | 1.000 | 5.000 |
| SES | 1.0 | 2.5714286 | 35 | 1.2434703 | 0.2101848 | 1.000 | 5.000 |
| SES | 2.0 | 1.6666667 | 3 | 1.1547005 | 0.6666667 | 1.000 | 3.000 |
ggplot2 from the tidyverse package (Wickham et al. 2019). Arrangement of plots into a grid was achieved using ggarrange from the ggpubr package (Kassambara 2020).
Figure 4.1: Boxplots Showing Deterimant Data At The Start And End Of The Study In Converted Patients.
In addition a LDA model and a questionnaire whose responses are fed into the model can be found here:dementia_grouping_questionnaire.Rmd. The unique packages used in this are as follows: caret (Kuhn 2020), MASS (Venables and Ripley 2002) , shiny (Chang et al. 2020) and shinyforms (Attali, n.d.). Explanation of package use can be found in the linked Rmd file. The model is trained to predict dementia grouping (demented or nondemented).
Issues with data set: does not cover all factors e.g. family history no cases of 3.0 or over, only x 2.0s all right handed quite small Dementia 2: for stats test visit data was converted to having two levels instead of multiple. So this only identifies factors that increased it over time but does not specify how lomg this takes. more research could be done on this
Word count is calculated using wordcountaddin (Marwick 2020).
This rmd script: 967
The dementia grouping questionnaire script: 338
The README: 342
Total: 1309